This notebook presents isobaric labeling data analysis strategy that includes data-driven normalization.

We will check how varying analysis components [summarization/normalization/differential abundance testing methods] changes end results of a quantitative proteomic study.

1 Unit component

1.1 log2 transformation of reporter ion intensities

1.2 original scale (not log-transformed) of reporter ion intensities

1.3 log2-transformation of intensity ratios (channel 127C in denominator)

2 Normalization component

2.1 CONSTANd

3 Summarization component

Summarize quantification values from PSM to peptide (first step) to protein (second step).

3.1 Median summarization (PSM to peptide to protein)

Notice that the row sums are not equal to Ncols anymore, because the median summarization does not preserve them (but mean summarization does).

Let’s also summarize the non-normalized data for comparison in the next section.

4 QC plots

Boxplots:

MA plots:

MA plots of two single samples taken from condition 1 and condition 0.125, measured in different MS runs (samples Mixture2_1:127C and Mixture1_2:129N, respectively).

MA plots of all samples from condition 1 and condition 0.125 (quantification values averaged within condition).

CV (coefficient of variation) plots:

PCA plots:

HC (hierarchical clustering) plots:

TO DO: - !!! only 1000 first record selected to speed up knitting, remove in the final version!!! - also use short label names like in PCA plot - unify the list of args across pcaplot.ils and dendrogram.ils. Make sure labeling and color picking is done in the same location (either inside or outside the function)

5 DEA component

5.1 Moderated t-test

TODO: - Also try to log-transform the intensity case, to see if there are large differences in the t-test results. - done. remove this code? NOTE: - actually, lmFit (used in moderated_ttest) was built for log2-transformed data. However, supplying untransformed intensities can also work. This just means that the effects in the linear model are also additive on the untransformed scale, whereas for log-transformed data they are multiplicative on the untransformed scale. Also, there may be a bias which occurs from biased estimates of the population means in the t-tests, as mean(X) is not equal to exp(mean(log(X))).

6 Results comparison

Confusion matrix:

Confusion matrix for variant: log2Intensity
contrast background spiked
not DEA 0.5 0 4083
DEA 0.5 0 38
not DEA 0.667 4081 2
DEA 0.667 32 6
not DEA 0.125 4077 6
DEA 0.125 9 29
not DEA 1 4073 10
DEA 1 8 30
0.5 0.667 0.125 1
Accuracy 0.0092211 0.9917496 0.9963601 0.9956321
Sensitivity 1.0000000 0.1578947 0.7631579 0.7894737
Specificity 0.0000000 0.9995102 0.9985305 0.9975508
PPV 0.0092211 0.7500000 0.8285714 0.7500000
NPV NaN 0.9922198 0.9977974 0.9980397
Confusion matrix for variant: Intensity
contrast background spiked
not DEA 0.5 0 4083
DEA 0.5 0 38
not DEA 0.667 4080 3
DEA 0.667 29 9
not DEA 0.125 4079 4
DEA 0.125 10 28
not DEA 1 4072 11
DEA 1 3 35
0.5 0.667 0.125 1
Accuracy 0.0092211 0.9922349 0.9966028 0.9966028
Sensitivity 1.0000000 0.2368421 0.7368421 0.9210526
Specificity 0.0000000 0.9992652 0.9990203 0.9973059
PPV 0.0092211 0.7500000 0.8750000 0.7608696
NPV NaN 0.9929423 0.9975544 0.9992638
Confusion matrix for variant: log2Ratio
contrast background spiked
not DEA 0.5 0 4083
DEA 0.5 0 38
not DEA 0.667 4080 3
DEA 0.667 29 9
not DEA 0.125 4079 4
DEA 0.125 10 28
not DEA 1 4072 11
DEA 1 3 35
0.5 0.667 0.125 1
Accuracy 0.0092211 0.9922349 0.9966028 0.9966028
Sensitivity 1.0000000 0.2368421 0.7368421 0.9210526
Specificity 0.0000000 0.9992652 0.9990203 0.9973059
PPV 0.0092211 0.7500000 0.8750000 0.7608696
NPV NaN 0.9929423 0.9975544 0.9992638
Confusion matrix for variant: Intensity_lateLog2
contrast background spiked
not DEA 0.5 4083 0
DEA 0.5 38 0
not DEA 0.667 4080 3
DEA 0.667 32 6
not DEA 0.125 4077 6
DEA 0.125 9 29
not DEA 1 4073 10
DEA 1 8 30
0.5 0.667 0.125 1
Accuracy 0.9907789 0.9915069 0.9963601 0.9956321
Sensitivity 0.0000000 0.1578947 0.7631579 0.7894737
Specificity 1.0000000 0.9992652 0.9985305 0.9975508
PPV NaN 0.6666667 0.8285714 0.7500000
NPV 0.9907789 0.9922179 0.9977974 0.9980397

Scatter plots:

Volcano plots:

Violin plots:

Let’s see whether the spiked protein fold changes make sense

7 Conclusions

8 Session information

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_BE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_BE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_BE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dendextend_1.14.0  CONSTANd_0.99.0    forcats_0.5.0      stringr_1.4.0     
##  [5] dplyr_1.0.2        purrr_0.3.4        readr_1.4.0        tidyr_1.1.2       
##  [9] tibble_3.0.4       tidyverse_1.3.0    kableExtra_1.2.1   psych_2.0.9       
## [13] gridExtra_2.3      RColorBrewer_1.1-2 stringi_1.5.3      limma_3.45.18     
## [17] caret_6.0-86       ggplot2_3.3.2      lattice_0.20-41   
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-149         fs_1.5.0             lubridate_1.7.9     
##  [4] webshot_0.5.2        httr_1.4.2           tools_4.0.3         
##  [7] backports_1.1.10     R6_2.4.1             rpart_4.1-15        
## [10] mgcv_1.8-33          DBI_1.1.0            colorspace_1.4-1    
## [13] nnet_7.3-14          withr_2.3.0          tidyselect_1.1.0    
## [16] mnormt_2.0.2         compiler_4.0.3       cli_2.1.0           
## [19] rvest_0.3.6          xml2_1.3.2           labeling_0.3        
## [22] scales_1.1.1         digest_0.6.26        rmarkdown_2.4       
## [25] pkgconfig_2.0.3      htmltools_0.5.0      highr_0.8           
## [28] dbplyr_1.4.4         rlang_0.4.8          readxl_1.3.1        
## [31] rstudioapi_0.11      farver_2.0.3         generics_0.0.2      
## [34] jsonlite_1.7.1       ModelMetrics_1.2.2.2 magrittr_1.5        
## [37] Matrix_1.2-18        Rcpp_1.0.5           munsell_0.5.0       
## [40] fansi_0.4.1          viridis_0.5.1        lifecycle_0.2.0     
## [43] pROC_1.16.2          yaml_2.2.1           MASS_7.3-53         
## [46] plyr_1.8.6           recipes_0.1.14       grid_4.0.3          
## [49] blob_1.2.1           parallel_4.0.3       crayon_1.3.4        
## [52] haven_2.3.1          splines_4.0.3        hms_0.5.3           
## [55] tmvnsim_1.0-2        knitr_1.30           pillar_1.4.6        
## [58] reshape2_1.4.4       codetools_0.2-16     stats4_4.0.3        
## [61] reprex_0.3.0         glue_1.4.2           evaluate_0.14       
## [64] data.table_1.13.2    modelr_0.1.8         vctrs_0.3.4         
## [67] foreach_1.5.1        cellranger_1.1.0     gtable_0.3.0        
## [70] assertthat_0.2.1     xfun_0.18            gower_0.2.2         
## [73] prodlim_2019.11.13   broom_0.7.1          e1071_1.7-4         
## [76] class_7.3-17         survival_3.2-7       viridisLite_0.3.0   
## [79] timeDate_3043.102    iterators_1.0.13     lava_1.6.8          
## [82] ellipsis_0.3.1       ipred_0.9-9